Update appliance for stackhpc.openhpc nodegroup/partition changes #666

sjpb · 2025-05-08T20:58:51Z

Update appliance to cope with stackhpc/ansible-role-openhpc#183 which changed openhpc_slurm_partitions to openhpc_nodegroups and openhpc_partitions.

As an example, a configuration like:

#environments/site/inventory/group_vars/all/openhpc.yml:
openhpc_slurm_partitions:
  - name: standard
    default: 'YES'
    groups:
       - name: largemem
       - name: gpu
         gres:
          - conf: gpu:A40:2
            file: /dev/nvidia[0-1]

would now be represented as:

#environments/site/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
      gres:
        - conf: gpu:A40:2
          file: /dev/nvidia[0-1]
openhpc_partitions:
  - name: standard
    default: 'YES'
    nodegroups:
      - largemem
      - gpu

Note that:

Arbitrary slurm.conf Node-level parameters can now be included via the optional node_params entry on openhpc_nodegroup elements.
The skeleton OpenTofu configuration no longer templates out partition configuration variables. Instead OpenTofu templates an Ansible variable cluster_compute_groups variable into the inventory/hosts.yml file. NB: This file should be removed once OpenTofu configurations have been updated.
A default openhpc_nodegroups is calculated in environments/common/inventory/group_vars/all/openhpc.yml. This uses the above to creates a node group per key in the OpenTofu compute variable. This is usually what is required, but this variable will need overriding if custom Node-level Slurm parameters/configuration is required (e.g. for GRES).
A default openhpc_partitions is calculated. This includes one partition per node group, plus a rebuild partition if the rebuild group is active. This should not generally be overriden. Instead normal (non-rebuild) partitions should be modified using openhpc_user_partitions, which is the same format.

This affects:

common environment openhpc config
skeleton templating
caas environment infra creation and ansible variables
stackhpc environment openhpc overides
configuration required to support rebuild - now automatic
can now call stackhpc.openhpc:validate.yml from ansible/validate.yml

#665 is replaced by this.

sjpb · 2025-05-08T21:01:04Z

TODO: test this with adding extra nodes in stackhpc env
[edit:] done, see below.

…change

sjpb · 2025-05-09T14:08:58Z

Testing in .stackhpc env with adding "extra" nodes:

Default configuration, i.e. just adding nodes into 2nd compute group:

$ git diff environments/.stackhpc/tofu/main.tf
diff --git i/environments/.stackhpc/tofu/main.tf w/environments/.stackhpc/tofu/main.tf
index 8d78401b..ea58d4b0 100644
--- i/environments/.stackhpc/tofu/main.tf
+++ w/environments/.stackhpc/tofu/main.tf
@@ -84,7 +84,7 @@ module "cluster" {
         # Normally-empty partition for testing:
         extra: {
             nodes: []
-            #nodes: ["extra-0", "extra-1"]
+            nodes: ["extra-0", "extra-1"]
             flavor: var.other_node_flavor
         }
     }
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
extra        up 60-00:00:0      2   idle RL9-extra-[0-1]
standard*    up 60-00:00:0      2   idle RL9-compute-[0-1]

Ok.

With the above, changing to an explicitly-configured partition covering both compute groups:

$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..6dd18e9e 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,8 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: hpc
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
hpc*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK.

With overlapping partitions:

$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..9235f8e0 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,11 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: normal
+    nodegroups:
+      - standard
+  - name: all
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal       up 60-00:00:0      2   idle RL9-compute-[0-1]
all*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK

sjpb · 2025-05-09T15:58:39Z

WIP: testing on azimuth as slurm-v21 - running, but need to redeploy slurm after last force-push

jovial

Looks sensible to me. Just need to wait for the openhpc role changes to merge so that we can update requirements.yml.

sjpb · 2025-05-13T10:57:27Z

Now fully tested on CaaS.

sjpb · 2025-05-13T11:19:09Z

Converted to draft because #668 must merge first.

PoC of automating partition/nodegroup config

38e5fd3

sjpb mentioned this pull request May 8, 2025

Add support for configuring Multi-Instance GPUs (MIG) #656

Merged

sjpb added 8 commits May 9, 2025 08:43

update rebuild docs

3bf7db9

fixup ondemand partitions for openhpc_partitions

d730da6

fixup rebuild role docs for openhpc_partitions

4e641d1

fix caas for openhpc_partitions/openhpc_nodegroups

5e5b389

make caas provisioning less confusing

1f99e4a

fix openhpc_partition config for stackhpc.openhpc groups->nodegroups …

2b752ba

…change

Merge branch 'main' into feat/nodegroups-v1

2ebe631

run stackhpc_openhpc validation

7ce38c6

sjpb marked this pull request as ready for review May 9, 2025 13:32

sjpb requested a review from a team as a code owner May 9, 2025 13:32

sjpb changed the title ~~PoC of automating partition/nodegroup config~~ Update appliance for stackhpc.openhpc nodegroup/partition changes May 9, 2025

fix caas nodegroups typo

c8af52e

sjpb force-pushed the feat/nodegroups-v1 branch 4 times, most recently from a7b5cc1 to 32fb617 Compare May 9, 2025 15:44

fix partitions for caas and non-rebuilt-enabled clusters

e471458

sjpb force-pushed the feat/nodegroups-v1 branch from 32fb617 to e471458 Compare May 9, 2025 15:47

jovial reviewed May 12, 2025

View reviewed changes

sjpb marked this pull request as draft May 13, 2025 11:18

sjpb mentioned this pull request May 15, 2025

Adds support for disabling the generation of partitions.yml #665

Closed

Merge branch 'main' into feat/nodegroups-v1

a395411

sjpb marked this pull request as ready for review May 16, 2025 12:24

sjpb requested a review from jovial May 16, 2025 13:40

bump openhpc role to release

285a82b

jovial approved these changes May 20, 2025

View reviewed changes

sjpb merged commit b99b1e9 into main May 20, 2025
2 checks passed

sjpb deleted the feat/nodegroups-v1 branch May 20, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Uh oh!

sjpb commented May 8, 2025 •

edited

Loading

Uh oh!

sjpb commented May 8, 2025 •

edited

Loading

Uh oh!

sjpb commented May 9, 2025 •

edited

Loading

Uh oh!

sjpb commented May 9, 2025

Uh oh!

jovial left a comment

Uh oh!

sjpb commented May 13, 2025

Uh oh!

sjpb commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Uh oh!

Conversation

sjpb commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented May 9, 2025

Uh oh!

jovial left a comment

Choose a reason for hiding this comment

Uh oh!

sjpb commented May 13, 2025

Uh oh!

sjpb commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 9, 2025 •

edited

Loading